“From Multivariate to Longitudinal Data”
April 11, 2017
> dim(deng)
[1] 17585 255
> sum(deng == 0) / prod(dim(deng))
[1] 0.5019552
> head(sample(colnames(deng)))
[1] "earlyblast" "16cell" "4cell" "midblast" "lateblast" "16cell"
> head(sample(rownames(deng)))
[1] "Gm7073" "Mir697" "Uqcrc2" "Ap2m1" "Slc10a3" "Ccr1"
\( \forall \) gene \( g \), fit OLS Regression with known timepoint \( t \) per cell–
\[ \log_2(g +1) \sim \beta_0 + \beta_1 t \]
\( \forall \) gene \( g \), fit regression with permuted timepoint \( t^* \) per cell
\[ \log_2(g +1) \sim \beta_0 + \beta_1 t^* \]
\( \forall \) gene \( g \), fit regression with permuted factor timepoint \( t^{**} \)
\[ \log_2(g +1) \sim \beta_0 + \beta_1 t^{**} \]
Given a matrix of \( D \) genes (features) by \( n \) samples in a matrix \( Y \), determine a latent vector \( P \) with dimension \( 1 \) x \( n \) that reflects the (smooth) developmental trajectory of the \( n \) cells from the variance in \( D \) genes.
> cor(runif(length(time)) %>% sort(), as.numeric(time) %>% sort())^2
[1] 0.8799669
\( Y \) is a \( D \) x \( n \) matrix. Compute the covariance matrix–
\[ \Sigma = E(YY^{T}) - \mu \mu^T \]
where \( \mu = E(Y) \)
Then compute the spectral decomposition of \( \Sigma \)
\[ \Sigma a_j = \lambda_j a_j \]
for \( j \in (1, ..., D) \)
Then \( a_j \) represent the eigenvectors of the data matrix \( Y \).
Note the ordering of \( j \) is meaningful–
\[ \lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D \geq 0 \]
> irlba::prcomp_irlba()
> prcomp()
> princomp()
> cor(pca$rotation[,1], as.numeric(time))^2
[1] 0.5933009
PCA by itself isn't satisfactory…
Ideas for improvements?
1) Quantifiying uncertainty
2) Non-linear latent variable inference
Next several images taken from slides via Guy Wolf (Yale)
These slides can be found here
\( Y \) is a \( D \) x \( n \) matrix. \( x \) is a \( d \) x \( n \) matrix (\( d < D \)). Want to relate \( x \) and \( Y \) and assume that the relationship is linear–
\[ Y = Wx + \epsilon \]
where \( W \) is a \( D \) x \( d \) matrix.
By convention (after locating the matrix),
\[ x \sim \mathcal{N} (0, I) \]
where \( I \) is the identity matrix of dimension \( d \) x \( d \).
Specify the error, \[ \epsilon \sim \mathcal{N} (0, \Psi) \]
where we assume \( \Psi \) to be a diagonal matrix \( D \) x \( D \).
Then \( Y_i \) for \( i \in (1, ..., n) \),
\[ Y_i | x_i \sim \mathcal{N} (0, \Psi) \]
Then we can integrate out the latent variables–
\[ p( Y_i | W, \Psi)= \int p( Y_i | x_i , W, \Psi) p(x_i) dx_i \]
Under this specification, we can write the MVN distribution for our observations–
\[ Y \sim \mathcal{N} (0, C), C = WW^T + \Psi \]
Young-Whittle Factor Analysis (1950s)
\( \psi_i \) element of the diagonal of \( \Psi \); add constraint \( \psi_i = \sigma^2 \)
Assume \( \sigma^2 \) known, MLE yields same \( W \) as PCA
\[ \mathcal{L} = \tfrac{-N}{2}\{ D\log(2\pi) + \log|C| + \textrm{tr}(C^{-1}\Sigma) \} \]
library(pcaMethods)
pca()
resPCA <- pca(data, method="svd", center=FALSE, nPcs=5)
resPPCA <- pca(data, method="ppca", center=FALSE, nPcs=5)
resBPCA <- pca(data, method="bpca", center=FALSE, nPcs=5)
resSVDI <- pca(data, method="svdImpute", center=FALSE, nPcs=5)
library(pseudogp)
fit <- fitPseudotime(data, smoothing_alpha = 30, smoothing_beta = 6,
iter = 1000, chains = 1)
posteriorBoxplot(fit)
> cor(gplvm_means, as.numeric(time))^2
[1] 0.783522
> cor(pca$rotation[,1], as.numeric(time))^2
[1] 0.5933009
> cor(gplvm_means, as.numeric(time))^2
[1] 0.783522